-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Parquet reader builder supports building multiple ranges to read #3841
Conversation
4c6649b
to
8f9c9d2
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3841 +/- ##
==========================================
- Coverage 85.60% 85.32% -0.28%
==========================================
Files 952 953 +1
Lines 163079 163431 +352
==========================================
- Hits 139596 139451 -145
- Misses 23483 23980 +497 |
94a1590
to
907fae7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I hereby agree to the terms of the GreptimeDB CLA.
Refer to a related PR or issue link (optional)
What's changed and what's your intention?
This PR defines the
FileRange
struct for parquet files and implements a methodbuild_file_ranges()
to build ranges from a parquet file. AFileRange
contains a range of rows to read from a parquet file. We can read different ranges in parallel later. Now aFileRange
is a row group exactly.To reuse code, this PR
ParquetReader
. Now it adds aRowGroupReader
to readBatches
from a row group. TheParquetReader
invokes theRowGroupReader
to read the parquet file.FileRangeContext
for all ranges of the same parquet file. This PR also moves theprecise_filter()
method to the context as the context contains all inputs the method needs.Now the builder uses a method
build_reader_input()
to construct theFileRangeContext
and row groups to read. Bothbuild()
andbuild_file_ranges
can reuse this method.This PR also fixes some remaining issues and improves the
ReadFormat
helper struct.projection_indices
in advancefield_id_to_projected_index
in advance. Thenconvert_record_batch()
doesn't require&mut self
. This is necessary for theFileRangeContext
as we have to share theReadFormat
SimpleFilterEvaluator
into aSimpleFilterContext
. The context gets the column info in advance, from the expected region metadata. This ensures the context can use the correct column id to find the column in theReadFormat
Checklist